library(topGO)
library(knitr) # to use kable()
library(limma) # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')
This is the enrichment analysis for isoforms regulated by regime. Because I quantified the association of expression with regime in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with isoforms ordered by the proportion of expression variance explained by regime. Note that some isoforms with low variance may still be highly associated with regime, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of isoforms based on the significance of the differential expression between levels of regime, which does depend on fold change.
Functional annotation is in 2019-07-26. I will also upload two lists of isoforms, with either proportion of variance explained by regime or p-value of differential expression test.
tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))
To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the isoforms identifiers and the numeric values, either the portion of variance explained by regime or the p-values of the differential expression test. The structure() function adds attributes to an object.
Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues <- structure(tagPValue[,1], names = row.names(tagPValue))
rm(tagVariance, tagPValue)
There are 33037 isoforms scored with a variance portion and a differential expression p-value. It should actually be the exact same isoforms. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.
head(annotation)
## tagname
## 1 TCONS_00000002
## 2 TCONS_00000010
## 3 TCONS_00000011
## 4 TCONS_00000016
## 5 TCONS_00000017
## 6 TCONS_00000018
## goterms
## 1 GO:0008417|GO:0016020|GO:0006486
## 2 GO:0043130|GO:0005515|GO:0043161
## 3 GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 6 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)
There are 19735 isoforms with GO annotations. But the differential expression analysis includes many more isoforms. In order to include the not-annotated isoforms in the enrichment analysis, to see if annotated or not annotated isoforms are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.
for (tag in unique(c(names(PValues), names(Variance)))) {
if (! tag %in% names(allgenes2GO)) {
allgenes2GO <- append(allgenes2GO,
structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
}
}
Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.
I generate three datasets, to analyse each of the three ontologies.
selection <- function(allScores) {return(allScores < 0.01)}
GOdata.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
| ontology | Num_Genes | Num_GO_terms |
|---|---|---|
| BP | 26235 | 1582 |
| MF | 30580 | 791 |
| CC | 21616 | 390 |
There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.
BP.elim <- runTest(GOdata.BP, algorithm = "elim", statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea <- runTest(GOdata.BP, algorithm = "lea", statistic = "ks")
MF.elim <- runTest(GOdata.MF, algorithm = "elim", statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea <- runTest(GOdata.MF, algorithm = "lea", statistic = "ks")
CC.elim <- runTest(GOdata.CC, algorithm = "elim", statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea <- runTest(GOdata.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
| ontology | algorithm | TermsTested | Significant |
|---|---|---|---|
| BP | elim | 1582 | 19 |
| BP | weight01 | 1582 | 18 |
| BP | lea | 1582 | 24 |
| MF | elim | 791 | 20 |
| MF | weight01 | 791 | 19 |
| MF | lea | 791 | 29 |
| CC | elim | 390 | 13 |
| CC | weight01 | 390 | 11 |
| CC | lea | 390 | 21 |
rm(ResultsSummary)
The topGO package only looks for GO terms that are overrepresented among the top of the gene list (genes with lowest score, assumed to be a \(p\) value; see 2020-06-30). Thus, underrepresented GO terms are ignored, even though it could be interesting to know if among differentially expressed genes there are fewer genes annotated with certain terms than expected by chance. The way to identify those terms would be to reverse the scores. In this case, running the analysis with \(1 - p\) values would be to search for terms that are overrepresented among non-differentially expressed genes. Not worth pursuing now.
Note that, despite my initial confusion (see previous commits), it is not true that significant terms with fewer significant genes than expected are significantly underrepresented terms. It is just possible that a term is significant because of the overall distribution of scores of genes annotated with that term, even if none of those genes (or just fewer than expected) are significant. This is because of the use of Kolmogorov-Smirnov test, instead of the Fisher’s exact test. Note that even though no hard threshold is necessary in the K-S test, GenTable() produces a summary table with number of annotated and significant genes for every significant GO term. That number of significant term is determined by the selection() function passed to the topGO object, only for display purposes unless using Fisher’s exact test.
orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms
BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
write.table(BP.all, file=paste(TAG, VAR, 'BPsummary.tsv', sep='/'),
quote=FALSE, sep='\t', row.names=FALSE)
kable(BP.all,
caption = "Most over-represented terms of the Biological Process ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0007186 | G protein-coupled receptor signaling pat… | 302 | 17 | 23.25 | 1 | 4.2e-06 | 2.7e-06 | 4.2e-06 |
| GO:0006508 | proteolysis | 822 | 89 | 63.29 | 2 | 1.5e-05 | 3.1e-06 | 5.8e-06 |
| GO:0060271 | cilium assembly | 73 | 3 | 5.62 | 5 | 0.00022 | 1.3e-05 | 5.5e-07 |
| GO:0006355 | regulation of transcription, DNA-templat… | 621 | 44 | 47.81 | 3 | 3.2e-05 | 4.9e-05 | 3.2e-05 |
| GO:0060294 | cilium movement involved in cell motilit… | 7 | 1 | 0.54 | 4 | 0.00020 | 0.00020 | 0.00020 |
| GO:0007154 | cell communication | 1169 | 69 | 90.01 | 830 | 0.70082 | 0.00021 | 0.73799 |
| GO:0000079 | regulation of cyclin-dependent protein s… | 9 | 4 | 0.69 | 6 | 0.00030 | 0.00030 | 0.00030 |
| GO:0006813 | potassium ion transport | 103 | 12 | 7.93 | 7 | 0.00044 | 0.00047 | 0.00044 |
| GO:0005991 | trehalose metabolic process | 22 | 8 | 1.69 | 9 | 0.00107 | 0.00107 | 1.3e-05 |
| GO:0044271 | cellular nitrogen compound biosynthetic … | 1365 | 99 | 105.10 | 1483 | 0.99078 | 0.00252 | 0.98536 |
| GO:0035082 | axoneme assembly | 18 | 1 | 1.39 | 8 | 0.00047 | 0.00273 | 0.00047 |
| GO:0006520 | cellular amino acid metabolic process | 208 | 18 | 16.02 | 480 | 0.41000 | 0.00309 | 0.59686 |
| GO:0005992 | trehalose biosynthetic process | 7 | 4 | 0.54 | 11 | 0.00315 | 0.00315 | 0.00315 |
| GO:0006030 | chitin metabolic process | 103 | 18 | 7.93 | 14 | 0.00656 | 0.00601 | 0.00656 |
| GO:0005975 | carbohydrate metabolic process | 360 | 45 | 27.72 | 123 | 0.07878 | 0.00763 | 0.06199 |
| GO:0006979 | response to oxidative stress | 40 | 12 | 3.08 | 15 | 0.00790 | 0.00790 | 0.00790 |
| GO:0006367 | transcription initiation from RNA polyme… | 20 | 7 | 1.54 | 17 | 0.00828 | 0.00828 | 0.00828 |
| GO:0031146 | SCF-dependent proteasomal ubiquitin-depe… | 5 | 0 | 0.38 | 18 | 0.00893 | 0.00893 | 0.00893 |
| GO:0042246 | tissue regeneration | 6 | 2 | 0.46 | 20 | 0.01006 | 0.01006 | 0.01006 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term = Term(GOTERM[sigTerms]),
Definition = Definition(GOTERM[sigTerms]),
PValue=score(BP.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with',
VAR, 'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0007186 | G protein-coupled receptor signaling pathway | A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). | 0.0000027 |
| GO:0006508 | proteolysis | The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. | 0.0000031 |
| GO:0060271 | cilium assembly | The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. | 0.0000132 |
| GO:0006355 | regulation of transcription, DNA-templated | Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. | 0.0000488 |
| GO:0060294 | cilium movement involved in cell motility | Movement of cilia mediated by motor proteins that contributes to the movement of a cell. | 0.0001995 |
| GO:0000079 | regulation of cyclin-dependent protein serine/threonine kinase activity | Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. | 0.0003019 |
| GO:0006813 | potassium ion transport | The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. | 0.0004678 |
| GO:0005991 | trehalose metabolic process | The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. | 0.0010651 |
| GO:0035082 | axoneme assembly | The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. | 0.0027293 |
| GO:0005992 | trehalose biosynthetic process | The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. | 0.0031481 |
| GO:0006030 | chitin metabolic process | The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. | 0.0060109 |
| GO:0006979 | response to oxidative stress | Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of oxidative stress, a state often resulting from exposure to high levels of reactive oxygen species, e.g. superoxide anions, hydrogen peroxide (H2O2), and hydroxyl radicals. | 0.0079030 |
| GO:0006367 | transcription initiation from RNA polymerase II promoter | Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. | 0.0082808 |
| GO:0031146 | SCF-dependent proteasomal ubiquitin-dependent protein catabolic process | The chemical reactions and pathways resulting in the breakdown of a protein or peptide by hydrolysis of its peptide bonds, initiated by the covalent attachment of ubiquitin, with ubiquitin-protein ligation catalyzed by an SCF (Skp1/Cul1/F-box protein) complex, and mediated by the proteasome. | 0.0089260 |
I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.
showSigOfNodes(GOdata.BP, score(BP.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 166
## Number of Edges = 300
##
## $complete.dag
## [1] "A graph with 166 nodes."
This is just a example of the most significant GO term:
showGroupDensity(GOdata.BP, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdata.BP, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdata.BP, orderedTerms[3], rm.one=FALSE)
orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms
MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
write.table(MF.all, file=paste(TAG, VAR, 'MFsummary.tsv', sep='/'),
quote=FALSE, sep='\t', row.names=FALSE)
kable(MF.all,
caption = "Most over-represented terms of the Molecular Function ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0005509 | calcium ion binding | 638 | 61 | 48.86 | 1 | 3.0e-16 | 3.0e-16 | 3.0e-16 |
| GO:0043565 | sequence-specific DNA binding | 221 | 11 | 16.93 | 3 | 0.00035 | 5.7e-05 | 0.00035 |
| GO:0004930 | G protein-coupled receptor activity | 251 | 16 | 19.22 | 2 | 9.9e-06 | 5.8e-05 | 1.3e-06 |
| GO:0008417 | fucosyltransferase activity | 42 | 1 | 3.22 | 4 | 0.00041 | 0.00041 | 0.00041 |
| GO:0003774 | motor activity | 277 | 12 | 21.21 | 156 | 0.13853 | 0.00044 | 0.13853 |
| GO:0030248 | cellulose binding | 8 | 5 | 0.61 | 5 | 0.00054 | 0.00054 | 0.00054 |
| GO:0004555 | alpha,alpha-trehalase activity | 15 | 4 | 1.15 | 6 | 0.00113 | 0.00113 | 0.00113 |
| GO:0008134 | transcription factor binding | 33 | 3 | 2.53 | 20 | 0.00828 | 0.00118 | 0.00828 |
| GO:0005200 | structural constituent of cytoskeleton | 23 | 3 | 1.76 | 7 | 0.00121 | 0.00121 | 0.00121 |
| GO:0004181 | metallocarboxypeptidase activity | 39 | 5 | 2.99 | 8 | 0.00178 | 0.00178 | 0.00178 |
| GO:0016462 | pyrophosphatase activity | 952 | 68 | 72.91 | 210 | 0.23026 | 0.00297 | 0.68247 |
| GO:0004198 | calcium-dependent cysteine-type endopept… | 62 | 7 | 4.75 | 10 | 0.00301 | 0.00301 | 0.00301 |
| GO:0004252 | serine-type endopeptidase activity | 143 | 24 | 10.95 | 11 | 0.00329 | 0.00329 | 0.00329 |
| GO:0008061 | chitin binding | 112 | 16 | 8.58 | 14 | 0.00393 | 0.00393 | 0.00393 |
| GO:0004601 | peroxidase activity | 59 | 12 | 4.52 | 93 | 0.06946 | 0.00397 | 0.06946 |
| GO:0004611 | phosphoenolpyruvate carboxykinase activi… | 7 | 1 | 0.54 | 15 | 0.00440 | 0.00440 | 0.00440 |
| GO:0004983 | neuropeptide Y receptor activity | 11 | 0 | 0.84 | 18 | 0.00670 | 0.00670 | 0.00670 |
| GO:0005267 | potassium channel activity | 80 | 10 | 6.13 | 12 | 0.00353 | 0.00684 | 0.00353 |
| GO:0008047 | enzyme activator activity | 135 | 10 | 10.34 | 109 | 0.08050 | 0.00834 | 0.08050 |
| GO:0030246 | carbohydrate binding | 83 | 17 | 6.36 | 28 | 0.01315 | 0.01064 | 0.04417 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MF.weight01)[sigTerms]),
caption = paste('Molecular function terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0005509 | calcium ion binding | Interacting selectively and non-covalently with calcium ions (Ca2+). | 0.0000000 |
| GO:0043565 | sequence-specific DNA binding | Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. | 0.0000573 |
| GO:0004930 | G protein-coupled receptor activity | Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. | 0.0000582 |
| GO:0008417 | fucosyltransferase activity | Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. | 0.0004103 |
| GO:0030248 | cellulose binding | Interacting selectively and non-covalently with cellulose. | 0.0005392 |
| GO:0004555 | alpha,alpha-trehalase activity | Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. | 0.0011331 |
| GO:0008134 | transcription factor binding | Interacting selectively and non-covalently with a transcription factor, any protein required to initiate or regulate transcription. | 0.0011813 |
| GO:0005200 | structural constituent of cytoskeleton | The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. | 0.0012128 |
| GO:0004181 | metallocarboxypeptidase activity | Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. | 0.0017817 |
| GO:0004198 | calcium-dependent cysteine-type endopeptidase activity | Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. | 0.0030053 |
| GO:0004252 | serine-type endopeptidase activity | Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). | 0.0032931 |
| GO:0008061 | chitin binding | Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. | 0.0039312 |
| GO:0004611 | phosphoenolpyruvate carboxykinase activity | Catalysis of the reaction: source of phosphate + oxaloacetate = phosphoenolpyruvate + CO2 + other reaction products. | 0.0043958 |
| GO:0004983 | neuropeptide Y receptor activity | Combining with neuropeptide Y to initiate a change in cell activity. | 0.0067048 |
| GO:0005267 | potassium channel activity | Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. | 0.0068424 |
showSigOfNodes(GOdata.MF, score(MF.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 86
## Number of Edges = 102
##
## $complete.dag
## [1] "A graph with 86 nodes."
showGroupDensity(GOdata.MF, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdata.MF, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdata.MF, orderedTerms[3], rm.one=FALSE)
orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms
CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
write.table(CC.all, file=paste(TAG, VAR, 'CCsummary.tsv', sep='/'),
quote=FALSE, sep='\t', row.names=FALSE)
kable(CC.all,
caption = "Most over-represented terms of the Cellular Component ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0005576 | extracellular region | 197 | 38 | 15.03 | 1 | 1.7e-06 | 3.3e-05 | 1.7e-06 |
| GO:0005743 | mitochondrial inner membrane | 44 | 12 | 3.36 | 2 | 4.8e-05 | 0.00032 | 4.8e-05 |
| GO:0016459 | myosin complex | 57 | 3 | 4.35 | 3 | 0.00036 | 0.00036 | 0.00036 |
| GO:0005929 | cilium | 52 | 3 | 3.97 | 6 | 0.00071 | 0.00082 | 0.00071 |
| GO:0016021 | integral component of membrane | 1739 | 155 | 132.66 | 4 | 0.00046 | 0.00236 | 8.1e-05 |
| GO:0005886 | plasma membrane | 324 | 34 | 24.72 | 7 | 0.00082 | 0.00445 | 0.00097 |
| GO:0090575 | RNA polymerase II transcription factor c… | 47 | 11 | 3.59 | 31 | 0.02538 | 0.00503 | 0.02538 |
| GO:0098800 | inner mitochondrial membrane protein com… | 27 | 7 | 2.06 | 26 | 0.02146 | 0.00526 | 0.02146 |
| GO:0042555 | MCM complex | 12 | 5 | 0.92 | 11 | 0.00624 | 0.00624 | 0.00624 |
| GO:0099512 | supramolecular fiber | 52 | 6 | 3.97 | 9 | 0.00344 | 0.00766 | 0.00344 |
| GO:0031225 | anchored component of membrane | 8 | 4 | 0.61 | 13 | 0.00813 | 0.00813 | 0.00813 |
| GO:0008076 | voltage-gated potassium channel complex | 30 | 5 | 2.29 | 14 | 0.01150 | 0.01150 | 0.01150 |
| GO:0016020 | membrane | 3059 | 260 | 233.36 | 30 | 0.02483 | 0.01168 | 2.1e-07 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CC.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0005576 | extracellular region | The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. | 0.0000329 |
| GO:0005743 | mitochondrial inner membrane | The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. | 0.0003155 |
| GO:0016459 | myosin complex | A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. | 0.0003574 |
| GO:0005929 | cilium | A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. | 0.0008211 |
| GO:0016021 | integral component of membrane | The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. | 0.0023586 |
| GO:0005886 | plasma membrane | The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. | 0.0044522 |
| GO:0042555 | MCM complex | A hexameric protein complex required for the initiation and regulation of DNA replication. | 0.0062379 |
| GO:0099512 | supramolecular fiber | A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. | 0.0076615 |
| GO:0031225 | anchored component of membrane | The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. | 0.0081261 |
showSigOfNodes(GOdata.CC, score(CC.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 63
## Number of Edges = 110
##
## $complete.dag
## [1] "A graph with 63 nodes."
showGroupDensity(GOdata.CC, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdata.CC, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdata.CC, orderedTerms[3], rm.one=FALSE)
I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by regime. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to \(p\)-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by regime to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by regime.
selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
| ontology | Num_Genes | Num_GO_terms |
|---|---|---|
| BP | 26235 | 1582 |
| MF | 30580 | 791 |
| CC | 21616 | 390 |
BPvar.elim <- runTest(GOdataVar.BP, algorithm = "elim", statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea <- runTest(GOdataVar.BP, algorithm = "lea", statistic = "ks")
MFvar.elim <- runTest(GOdataVar.MF, algorithm = "elim", statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea <- runTest(GOdataVar.MF, algorithm = "lea", statistic = "ks")
CCvar.elim <- runTest(GOdataVar.CC, algorithm = "elim", statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea <- runTest(GOdataVar.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
| ontology | algorithm | TermsTested | Significant |
|---|---|---|---|
| BP | elim | 1582 | 22 |
| BP | weight01 | 1582 | 23 |
| BP | lea | 1582 | 29 |
| MF | elim | 791 | 29 |
| MF | weight01 | 791 | 27 |
| MF | lea | 791 | 39 |
| CC | elim | 390 | 11 |
| CC | weight01 | 390 | 12 |
| CC | lea | 390 | 19 |
rm(ResultsSummary)
orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms
BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(BPvar.all,
caption = "Most over-represented terms of the Biological Process ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0007186 | G protein-coupled receptor signaling pat… | 302 | 44 | 29.72 | 1 | 9.7e-07 | 3.4e-08 | 9.7e-07 |
| GO:0006508 | proteolysis | 822 | 121 | 80.90 | 5 | 0.00019 | 1.5e-07 | 1.2e-05 |
| GO:0060271 | cilium assembly | 73 | 11 | 7.18 | 3 | 7.2e-05 | 5.7e-05 | 5.0e-07 |
| GO:0006355 | regulation of transcription, DNA-templat… | 621 | 65 | 61.12 | 2 | 6.8e-05 | 0.00019 | 6.8e-05 |
| GO:0006813 | potassium ion transport | 103 | 14 | 10.14 | 4 | 7.5e-05 | 0.00046 | 7.5e-05 |
| GO:0005992 | trehalose biosynthetic process | 7 | 4 | 0.69 | 7 | 0.00064 | 0.00064 | 0.00064 |
| GO:0060294 | cilium movement involved in cell motilit… | 7 | 4 | 0.69 | 8 | 0.00067 | 0.00067 | 0.00067 |
| GO:0006030 | chitin metabolic process | 103 | 19 | 10.14 | 6 | 0.00034 | 0.00071 | 0.00034 |
| GO:0006520 | cellular amino acid metabolic process | 208 | 18 | 20.47 | 719 | 0.66735 | 0.00075 | 0.65245 |
| GO:0007160 | cell-matrix adhesion | 24 | 8 | 2.36 | 9 | 0.00075 | 0.00075 | 0.00075 |
| GO:0006801 | superoxide metabolic process | 20 | 1 | 1.97 | 11 | 0.00107 | 0.00107 | 0.00107 |
| GO:0000079 | regulation of cyclin-dependent protein s… | 9 | 5 | 0.89 | 12 | 0.00112 | 0.00112 | 0.00112 |
| GO:0005975 | carbohydrate metabolic process | 360 | 56 | 35.43 | 52 | 0.02497 | 0.00167 | 0.01764 |
| GO:0006814 | sodium ion transport | 108 | 13 | 10.63 | 13 | 0.00167 | 0.00167 | 0.00167 |
| GO:0007154 | cell communication | 1169 | 107 | 115.05 | 799 | 0.73112 | 0.00185 | 0.74893 |
| GO:0005991 | trehalose metabolic process | 22 | 9 | 2.17 | 14 | 0.00188 | 0.00188 | 5.4e-06 |
| GO:0044271 | cellular nitrogen compound biosynthetic … | 1365 | 104 | 134.34 | 1469 | 0.99399 | 0.00189 | 0.94722 |
| GO:0006821 | chloride transport | 15 | 1 | 1.48 | 10 | 0.00081 | 0.00212 | 0.00081 |
| GO:0006367 | transcription initiation from RNA polyme… | 20 | 8 | 1.97 | 15 | 0.00272 | 0.00272 | 0.00272 |
| GO:0035082 | axoneme assembly | 18 | 4 | 1.77 | 16 | 0.00303 | 0.00327 | 0.00303 |
| GO:0071805 | potassium ion transmembrane transport | 20 | 3 | 1.97 | 28 | 0.01748 | 0.00528 | 0.01748 |
| GO:0046173 | polyol biosynthetic process | 8 | 3 | 0.79 | 218 | 0.14874 | 0.00618 | 0.14874 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(BPvar.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0007186 | G protein-coupled receptor signaling pathway | A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). | 0.0000000 |
| GO:0006508 | proteolysis | The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. | 0.0000001 |
| GO:0060271 | cilium assembly | The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. | 0.0000567 |
| GO:0006355 | regulation of transcription, DNA-templated | Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. | 0.0001890 |
| GO:0006813 | potassium ion transport | The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. | 0.0004553 |
| GO:0005992 | trehalose biosynthetic process | The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. | 0.0006394 |
| GO:0060294 | cilium movement involved in cell motility | Movement of cilia mediated by motor proteins that contributes to the movement of a cell. | 0.0006687 |
| GO:0006030 | chitin metabolic process | The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. | 0.0007143 |
| GO:0007160 | cell-matrix adhesion | The binding of a cell to the extracellular matrix via adhesion molecules. | 0.0007536 |
| GO:0006801 | superoxide metabolic process | The chemical reactions and pathways involving superoxide, the superoxide anion O2- (superoxide free radical), or any compound containing this species. | 0.0010662 |
| GO:0000079 | regulation of cyclin-dependent protein serine/threonine kinase activity | Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. | 0.0011196 |
| GO:0006814 | sodium ion transport | The directed movement of sodium ions (Na+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. | 0.0016691 |
| GO:0005991 | trehalose metabolic process | The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. | 0.0018778 |
| GO:0006821 | chloride transport | The directed movement of chloride into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. | 0.0021196 |
| GO:0006367 | transcription initiation from RNA polymerase II promoter | Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. | 0.0027247 |
| GO:0035082 | axoneme assembly | The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. | 0.0032734 |
| GO:0042246 | tissue regeneration | The regrowth of lost or destroyed tissues. | 0.0084031 |
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 174
## Number of Edges = 306
##
## $complete.dag
## [1] "A graph with 174 nodes."
Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.BP, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdataVar.BP, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdataVar.BP, orderedTerms[3], rm.one=FALSE)
orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms
MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(MFvar.all,
caption = "Most over-represented terms of the Molecular Function ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0005509 | calcium ion binding | 638 | 116 | 64.38 | 1 | 4.6e-18 | 4.6e-18 | 4.6e-18 |
| GO:0004930 | G protein-coupled receptor activity | 251 | 39 | 25.33 | 2 | 5.2e-07 | 3.1e-07 | 1.2e-08 |
| GO:0043565 | sequence-specific DNA binding | 221 | 23 | 22.30 | 3 | 7.6e-05 | 3.5e-06 | 7.6e-05 |
| GO:0003774 | motor activity | 277 | 50 | 27.95 | 16 | 0.00499 | 9.0e-05 | 0.00499 |
| GO:0004252 | serine-type endopeptidase activity | 143 | 31 | 14.43 | 4 | 0.00035 | 0.00035 | 0.00035 |
| GO:0008061 | chitin binding | 112 | 18 | 11.30 | 5 | 0.00054 | 0.00054 | 0.00054 |
| GO:0030248 | cellulose binding | 8 | 2 | 0.81 | 6 | 0.00085 | 0.00085 | 0.00085 |
| GO:0004181 | metallocarboxypeptidase activity | 39 | 10 | 3.94 | 7 | 0.00091 | 0.00091 | 0.00091 |
| GO:0008417 | fucosyltransferase activity | 42 | 3 | 4.24 | 10 | 0.00149 | 0.00149 | 0.00149 |
| GO:0004555 | alpha,alpha-trehalase activity | 15 | 5 | 1.51 | 11 | 0.00220 | 0.00220 | 0.00220 |
| GO:0016462 | pyrophosphatase activity | 952 | 109 | 96.07 | 491 | 0.76140 | 0.00244 | 0.85685 |
| GO:0005267 | potassium channel activity | 80 | 13 | 8.07 | 8 | 0.00102 | 0.00282 | 0.00102 |
| GO:0016651 | oxidoreductase activity, acting on NAD(P… | 47 | 5 | 4.74 | 116 | 0.09384 | 0.00431 | 0.09384 |
| GO:0005200 | structural constituent of cytoskeleton | 23 | 10 | 2.32 | 13 | 0.00467 | 0.00467 | 0.00467 |
| GO:0004553 | hydrolase activity, hydrolyzing O-glycos… | 169 | 33 | 17.05 | 40 | 0.02098 | 0.00480 | 0.02098 |
| GO:0005247 | voltage-gated chloride channel activity | 7 | 0 | 0.71 | 17 | 0.00501 | 0.00501 | 0.00501 |
| GO:0004040 | amidase activity | 5 | 4 | 0.50 | 18 | 0.00552 | 0.00552 | 0.00552 |
| GO:0004198 | calcium-dependent cysteine-type endopept… | 62 | 13 | 6.26 | 19 | 0.00594 | 0.00594 | 0.00594 |
| GO:0031409 | pigment binding | 11 | 4 | 1.11 | 20 | 0.00602 | 0.00602 | 0.00602 |
| GO:0030246 | carbohydrate binding | 83 | 17 | 8.38 | 46 | 0.02647 | 0.00627 | 0.04503 |
| GO:0005507 | copper ion binding | 29 | 8 | 2.93 | 21 | 0.00646 | 0.00646 | 0.00646 |
| GO:0140098 | catalytic activity, acting on RNA | 278 | 12 | 28.05 | 544 | 0.83730 | 0.00709 | 0.99630 |
| GO:0005272 | sodium channel activity | 83 | 11 | 8.38 | 22 | 0.00722 | 0.00722 | 0.00722 |
| GO:0051015 | actin filament binding | 47 | 12 | 4.74 | 24 | 0.00740 | 0.00740 | 0.00740 |
| GO:0005201 | extracellular matrix structural constitu… | 7 | 3 | 0.71 | 27 | 0.00901 | 0.00901 | 0.00901 |
| GO:0008083 | growth factor activity | 8 | 4 | 0.81 | 28 | 0.00918 | 0.00918 | 0.00918 |
| GO:0008199 | ferric iron binding | 15 | 4 | 1.51 | 29 | 0.00992 | 0.00992 | 0.00992 |
| GO:0004601 | peroxidase activity | 59 | 16 | 5.95 | 23 | 0.00732 | 0.01186 | 0.00732 |
| GO:0005249 | voltage-gated potassium channel activity | 54 | 8 | 5.45 | 31 | 0.01347 | 0.01252 | 0.01347 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MFvar.weight01)[sigTerms]),
caption = paste('Molecular functions terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0005509 | calcium ion binding | Interacting selectively and non-covalently with calcium ions (Ca2+). | 0.0000000 |
| GO:0004930 | G protein-coupled receptor activity | Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. | 0.0000003 |
| GO:0043565 | sequence-specific DNA binding | Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. | 0.0000035 |
| GO:0003774 | motor activity | Catalysis of the generation of force resulting either in movement along a microfilament or microtubule, or in torque resulting in membrane scission, coupled to the hydrolysis of a nucleoside triphosphate. | 0.0000895 |
| GO:0004252 | serine-type endopeptidase activity | Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). | 0.0003473 |
| GO:0008061 | chitin binding | Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. | 0.0005355 |
| GO:0030248 | cellulose binding | Interacting selectively and non-covalently with cellulose. | 0.0008517 |
| GO:0004181 | metallocarboxypeptidase activity | Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. | 0.0009053 |
| GO:0008417 | fucosyltransferase activity | Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. | 0.0014917 |
| GO:0004555 | alpha,alpha-trehalase activity | Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. | 0.0021975 |
| GO:0005267 | potassium channel activity | Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. | 0.0028229 |
| GO:0005200 | structural constituent of cytoskeleton | The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. | 0.0046674 |
| GO:0005247 | voltage-gated chloride channel activity | Enables the transmembrane transfer of a chloride ion by a voltage-gated channel. A voltage-gated channel is a channel whose open state is dependent on the voltage across the membrane in which it is embedded. | 0.0050061 |
| GO:0004040 | amidase activity | Catalysis of the reaction: a monocarboxylic acid amide + H2O = a monocarboxylate + NH3. | 0.0055211 |
| GO:0004198 | calcium-dependent cysteine-type endopeptidase activity | Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. | 0.0059422 |
| GO:0031409 | pigment binding | Interacting selectively and non-covalently with a pigment, any general or particular coloring matter in living organisms, e.g. melanin. | 0.0060164 |
| GO:0005507 | copper ion binding | Interacting selectively and non-covalently with copper (Cu) ions. | 0.0064552 |
| GO:0005272 | sodium channel activity | Enables the facilitated diffusion of a sodium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. | 0.0072212 |
| GO:0051015 | actin filament binding | Interacting selectively and non-covalently with an actin filament, also known as F-actin, a helical filamentous polymer of globular G-actin subunits. | 0.0073983 |
| GO:0005201 | extracellular matrix structural constituent | The action of a molecule that contributes to the structural integrity of the extracellular matrix. | 0.0090108 |
| GO:0008083 | growth factor activity | The function that stimulates a cell to grow or proliferate. Most growth factors have other actions besides the induction of cell growth or proliferation. | 0.0091849 |
| GO:0008199 | ferric iron binding | Interacting selectively and non-covalently with ferric iron, Fe(III). | 0.0099227 |
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 109
## Number of Edges = 137
##
## $complete.dag
## [1] "A graph with 109 nodes."
I plot variance portion, but for the term found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.MF, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdataVar.MF, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdataVar.MF, orderedTerms[3], rm.one=FALSE)
orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms
CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))
kable(CCvar.all,
caption = "Most over-represented terms of the Cellular Component ontology.")
| GO.ID | Term | Annotated | Significant | Expected | Rank in elim | elim | weight01 | lea |
|---|---|---|---|---|---|---|---|---|
| GO:0005576 | extracellular region | 197 | 39 | 19.49 | 1 | 2.0e-06 | 1.9e-05 | 2.0e-06 |
| GO:0005743 | mitochondrial inner membrane | 44 | 11 | 4.35 | 3 | 5.5e-05 | 2.8e-05 | 5.5e-05 |
| GO:0016459 | myosin complex | 57 | 21 | 5.64 | 4 | 7.0e-05 | 7.0e-05 | 7.0e-05 |
| GO:0005929 | cilium | 52 | 15 | 5.15 | 5 | 0.0014 | 9.5e-05 | 0.00017 |
| GO:0016021 | integral component of membrane | 1739 | 215 | 172.08 | 2 | 3.4e-05 | 0.00024 | 4.7e-06 |
| GO:0016020 | membrane | 3059 | 337 | 302.70 | 11 | 0.0089 | 0.00231 | 2.1e-07 |
| GO:0090575 | RNA polymerase II transcription factor c… | 47 | 6 | 4.65 | 83 | 0.2041 | 0.00294 | 0.20411 |
| GO:0005891 | voltage-gated calcium channel complex | 11 | 4 | 1.09 | 6 | 0.0031 | 0.00310 | 0.00310 |
| GO:0005886 | plasma membrane | 324 | 40 | 32.06 | 26 | 0.0296 | 0.00417 | 0.00160 |
| GO:0099512 | supramolecular fiber | 52 | 14 | 5.15 | 9 | 0.0076 | 0.00469 | 0.00764 |
| GO:0008076 | voltage-gated potassium channel complex | 30 | 6 | 2.97 | 8 | 0.0055 | 0.00552 | 0.00552 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))
kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CCvar.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
| Term | Definition | PValue | |
|---|---|---|---|
| GO:0005576 | extracellular region | The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. | 0.0000195 |
| GO:0005743 | mitochondrial inner membrane | The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. | 0.0000282 |
| GO:0016459 | myosin complex | A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. | 0.0000700 |
| GO:0005929 | cilium | A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. | 0.0000950 |
| GO:0016021 | integral component of membrane | The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. | 0.0002372 |
| GO:0016020 | membrane | A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it. | 0.0023057 |
| GO:0005891 | voltage-gated calcium channel complex | A protein complex that forms a transmembrane channel through which calcium ions may pass in response to changes in membrane potential. | 0.0031048 |
| GO:0099512 | supramolecular fiber | A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. | 0.0046859 |
| GO:0008076 | voltage-gated potassium channel complex | A protein complex that forms a transmembrane channel through which potassium ions may cross a cell membrane in response to changes in membrane potential. | 0.0055216 |
| GO:0005581 | collagen trimer | A protein complex consisting of three collagen chains assembled into a left-handed triple helix. These trimers typically assemble into higher order structures. | 0.0086510 |
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
firstSigNodes = sum(significant.weight01),
wantedNodes = sigTerms)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 61
## Number of Edges = 103
##
## $complete.dag
## [1] "A graph with 61 nodes."
For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.
showGroupDensity(GOdataVar.CC, orderedTerms[1], rm.one=FALSE)
showGroupDensity(GOdataVar.CC, orderedTerms[2], rm.one=FALSE)
showGroupDensity(GOdataVar.CC, orderedTerms[3], rm.one=FALSE)
allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% BP.pvalue.sigTerms,
Variance = allTerms %in% BP.variance.sigTerms)))
ggplot(data = data.frame(PValue = rank(score(BP.weight01))[allTerms],
Variance = rank(score(BPvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth(method='lm') + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of BP terms by significance')
allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% MF.pvalue.sigTerms,
Variance = allTerms %in% MF.variance.sigTerms)))
ggplot(data = data.frame(PValue = rank(score(MF.weight01))[allTerms],
Variance = rank(score(MFvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth(methdo='lm') + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of MF terms by significance')
## Warning: Ignoring unknown parameters: methdo
allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% CC.pvalue.sigTerms,
Variance = allTerms %in% CC.variance.sigTerms)))
ggplot(data = data.frame(PValue = rank(score(CC.weight01))[allTerms],
Variance = rank(score(CCvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth(method='lm') + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of CC terms by significance')
I save the main variables to be able to load them in a new R session later.
save(allgenes2GO,
GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8
## [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8
## [7] LC_PAPER=ca_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Rgraphviz_2.30.0 ggplot2_3.3.0 limma_3.42.2
## [4] knitr_1.28 topGO_2.38.1 SparseM_1.78
## [7] GO.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.2
## [10] S4Vectors_0.24.3 Biobase_2.46.0 graph_1.64.0
## [13] BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.0.0 xfun_0.12 purrr_0.3.3 splines_3.6.3
## [5] lattice_0.20-41 colorspace_1.4-1 vctrs_0.2.4 htmltools_0.4.0
## [9] yaml_2.2.1 mgcv_1.8-33 blob_1.2.1 rlang_0.4.5
## [13] pillar_1.4.3 glue_1.4.0 withr_2.1.2 DBI_1.1.0
## [17] bit64_0.9-7 matrixStats_0.55.0 lifecycle_0.2.0 stringr_1.4.0
## [21] munsell_0.5.0 gtable_0.3.0 memoise_1.1.0 evaluate_0.14
## [25] labeling_0.3 fansi_0.4.1 highr_0.8 Rcpp_1.0.4
## [29] scales_1.1.0 farver_2.0.3 bit_1.1-15.2 digest_0.6.25
## [33] stringi_1.4.6 dplyr_0.8.5 cli_2.0.2 tools_3.6.3
## [37] magrittr_1.5 RSQLite_2.2.0 tibble_3.0.0 crayon_1.3.4
## [41] pkgconfig_2.0.3 Matrix_1.2-18 ellipsis_0.3.0 assertthat_0.2.1
## [45] rmarkdown_2.1 R6_2.4.1 nlme_3.1-149 compiler_3.6.3